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Abstract 

The CAP theorem is a fundamental result that applies to distributed storage systems. 
In this paper, we first present and prove two CAP-like impossibility theorems. To state 
these theorems, we present probabilistic models to characterize the three important el¬ 
ements of the CAP theorem: consistency (C), availability or latency (A), and partition 
tolerance (P). The theorems show the un-achievable envelope, i.e., which combinations 
of the parameters of the three models make them impossible to achieve together. Next, 
we present the design of a class of systems called PCAP that perform close to the enve¬ 
lope described by our theorems. In addition, these systems allow applications running 
on a single data-center to specify either a latency SLA or a consistency SLA. The PCAP 
systems automatically adapt, in real-time and under changing network conditions, to 
meet the SLA while optimizing the other C/A metric. We incorporate PCAP into two 
popular key-value stores - Apache Cassandra and Riak. Our experiments with these 
two deployments, under realistic workloads, reveal that the PCAP system satisfactorily 
meets SLAs, and performs close to the achievable envelope. We also extend PCAP from 
a single data-center to multiple geo-distributed data-centers. 
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1 Introduction 


Storage systems form the foundational platform for modern Internet services such as Web search, 
analytics, and social networking. Ever increasing user bases and massive data sets have forced 
users and applications to forgo conventional relational databases, and move towards a new class of 
scalable storage systems known as NoSQL key-value stores. Many of these distributed key-value 
stores (e.g., Cassandra [39], Riak [M], Dynamo [2T], Voldemort |l5]) support a simple GET/PUT 
interface for accessing and updating data items. The data items are replicated at multiple servers 
for fault tolerance. In addition, they offer a very weak notion of consistency known as eventual 
consistency isaEi, which roughly speaking, says that if no further updates are sent to a given data 
item, all replicas will eventually hold the same value. 

These key-value stores are preferred by applications for whom eventual consistency suffices, but 
where high availability and low latency (i.e., fast reads and writes [T]) are paramount. Latency 
is a critical metric for such cloud services because latency is correlated to user satisfaction - for 
instance, a 500 ms increase in latency for operations at Google.com can cause a 20% drop in 
revenue [52j . At Amazon, this translates to a $6M yearly loss per added millisecond of latency m- 
This correlation between delay and lost retention is fundamentally human. Humans suffer from 
a phenomenon called user cognitive drift, wherein if more than a second (or so) elapses between 
clicking on something and receiving a response, the user’s mind is already elsewhere. 

At the same time, clients in such applications expect freshness, i.e., data returned by a read 
to a key should come from the latest writes done to that key by any client. For instance, Netflix 
uses Cassandra to track positions in each video |14j . and freshness of data translates to accurate 
tracking and user satisfaction. This implies that clients care about a time-based notion of data 
freshness. Thus, this paper focuses on consistency based on the notion of data freshness (as defined 
later). 

The CAP theorem was proposed by Eric Brewer Haig, and later formally proved by Gilbert 
and Lynch [2S1[27|. It essentially states that a system can choose at most two of three desirable 
properties: Consistency (C), Availability (A), and Partition tolerance (P). Recently, Abadi [T] pro¬ 
posed to study the consistency-latency tradeoff, and unified the tradeoff with the CAP theorem. 
The unified result is called PACELC. It states that when a network partition occurs, one needs to 
choose between Availability and Consistency, otherwise the choice is between Latency and Con¬ 
sistency. We focus on the latter tradeoff as it is the common case. These prior results provided 
qualitative characterization of the tradeoff between consistency and availability/latency, while we 
provide a quantitative characterization of the tradeoff. 

Concretely, traditional CAP literature tends to focus on situations where “hard” network par¬ 
titions occur and the designer has to choose between C or A, e.g., in geo-distributed data-centers. 
However, individual data-centers themselves suffer far more frequently from “soft” partitions |2Uj . 
arising from periods of elevated message delays or loss rates (i.e., the “otherwise” part of PACELC) 
within a data-center. Neither the original CAP theorem nor the existing work on consistency in 
key-value stores [ZlEIlEolIMlIiaiiliaiSllEEl [59] address such soft partitions for a single 
data-center. 

In this paper we state and prove two CAP-like impossibility theorems. To state these theorems, 
we present probabilistic^ models to characterize the three important elements: soft partition, la¬ 
tency requirements, and consistency requirements. All our models take timeliness into account. Our 
latency model specifies soft bounds on operation latencies, as might be provided by the application 
in an SLA (Service Level Agreement). Our consistency model captures the notion of data freshness 

^By probabilistic, we mean the behavior is statistical over a long time period. 
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returned by read operations. Our partition model describes propagation delays in the nnderlying 
network. The resulting theorems show the un-achievable envelope, i.e., which combinations of the 
parameters in these three models (partition, latency, consistency) make them impossible to achieve 
together. Note that the focns of the paper is neither defining a new consistency model nor compar¬ 
ing different types of consistency models. Instead, we are interested in the un-achievable envelope 
of the three important elements and measuring how close a system can perform to this envelop. 

Next, we describe the design of a class of systems called PCAP (short for Probabilistic CAP) 
that perform close to the envelope described by our theorems. In addition, these systems allow 
applications running inside a single data-center to specify either a probabilistic latency SLA or a 
probabilistic consistency SLA. Given a probabilistic latency SLA, PCAP’s adaptive techniques meet 
the specified operational latency requirement, while optimizing the consistency achieved. Similarly, 
given a probabilistic consistency SLA, PCAP meets the consistency requirement while optimizing 
operational latency. PCAP does so under real and continnonsly changing network conditions. 
There are known use cases that would benefit from an latency SLA - these include the Netflix 
video tracking application online advertising m, and shopping cart applications [58] - each 
of these needs fast response times bnt is willing to tolerate some staleness. A known use case 
for consistency SLA is a Web search application |5^, which desires search resnlts with bonnded 
staleness bnt would like to minimize the response time. While the PCAP system can be used with a 
variety of consistency and latency models (like PBS (Tj), we use our PCAP models for concreteness. 

We have integrated our PCAP system into two key-valne stores - Apache Cassandra |3^ and 
Riak [3l|. Our experiments with these two deployments, using YCSB m benchmarks, reveal that 
PCAP systems satisfactorily meets a latency SLA (or consistency SLA), optimize the consistency 
metric (respectively latency metric), perform reasonably close to the envelope described by onr 
theorems, and scale well. 

We also extend PCAP from a single data-center to multiple geo-distributed data-centers. The 
key contribution of our second system (which we call GeoPCAP) is a set of rnles for composing 
probabilistic consistency/latency models from across multiple data-centers in order to derive the 
global consistency-latency tradeoff behavior. Realistic simnlations demonstrate that GeoPCAP can 
satisfactorily meet consistency/latency SLAs for applications interacting with multiple data-centers, 
while optimizing the other metric. 


2 Consistency-Latency Tradeoff 

We consider a key-value store system which provides a read/write API over an asynchronons dis¬ 
tributed message-passing network. The system consists of clients and servers, in which, servers are 
responsible for replicating the data (or read/write object) and ensuring the specified consistency 
reqnirements, and clients can invoke a write (or read) operation that stores (or retrieves) some value 
of the specified key by contacting server(s). Specifically, in the system, data can be propagated 
from a writer client to multiple servers by a replication mechanism or background mechanism such 
as read repair |2T|, and the data stored at servers can later be read by clients. There may be mnl- 
tiple versions of the data corresponding to the same key, and the exact value to be read by reader 
clients depends on how the system ensures the consistency requirements. Note that as addressed 
earlier, we define consistency based on freshness of the value returned by read operations (defined 
below). We first present onr probabilistic models for soft partition, latency and consistency. Then 
we present our impossibility results. These results only hold for a single data-center. Later in 
Section]^ we deal with the multiple data-center case. 
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2.1 Models 


To capture consistency, we defined a new notion called t-freshness, which is a form of eventual 
consistency. Consider a single key (or read/write object) being read and written concurrently by 
multiple clients. An operation O (read or write) has a start time Tstart{0) when the client issues O, 
and a hnish time Tfinish{0) when the client receives an answer (for a read) or an acknowledgment 
(for a write). The write operation ends when the client receives an acknowledgment from the server. 
The value of a write operation can be reflected on the server side (i.e., visible to other clients) any 
time after the write starts. For clarity of our presentation, we assume that all write operations 
end in this paper, which is reasonable given client retries. Note that the written value can still 
propagate to other servers after the write ends by the background mechanism.We assume that at 
time 0 (initial time), the key has a default value. 

Definition 1 t-freshness and t-staleness: A read operation R is said to be t-fresh if and only 
if R returns a value written by any write operation that starts at or after time Tfresh{R,t), which 
is defined below: 

1. If there is at least one write starting in the interval {Tstart{R) —t, Tstart{R)] ■ then Tfresh{R, t) = 
Tstart{,Rf t- 

2. If there is no write starting in the interval [Tstart{R) — t-,Tstart{R)], then there are two eases: 

(a) No write starts before R starts: then Tfresh{R,t) = 0- 

(b) Some write starts before R starts: then TfreshiRN) is the start time of the last write 
operation that starts before Tstart{R) — t- 

A read that is not t-fresh is said to be t-stale . 

Note that the above characterization of tfresh{RN) oiily depends on start times of operations. 
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Figure 1: Examples illustrating Definition^ Only start times of each operation are shown. 

Fig.[T] shows three examples for t-freshness. The figure shows the times at which several read 
and write operations are issued (the time when operations complete are not shown in the figure). 
W{x) in the figure denotes a write operation with a value x. Note that our definition of t-freshness 
allows a read to return a value that is written by a write issued after the read is issued. In 

Fig. Ini), Tfresh{R,t) = Tstart{R) — t = t' — t; therefore, R is t-fresh if it returns 2,3 or 4. In 

Fig. I^ii), Tfresh{R,t) = NtartiW{!)); therefore, R is t-fresh if it returns 1,4 or 5. In Fig. [^iii), 

Tfresh{R, t) = 0; therefore, R is t-fresh if it returns 4, 5 or the default. 
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Definition 2 Probabilistic Consistency: A key-value store satisfies (tcPic) -consistency if in 
any execution of the system, the fraction of read operations satisfying tc-freshness is at least (l—pic). 


Intuitively, pic is the likelihood of returning stale data, 
given the time-based freshness requirement tc- 


Two similar definitions have been proposed previously: (1) t-visibility from the Probabilistically 
Bounded Staleness (PBS) work [7], and (2) A-atomicity [3l]. These two metrics do not require 
a read to return the latest write, but provide a time bound on the staleness of the data returned 
by the read. The main difference between t-freshness and these is that we consider the start time 
of write operations rather than the end time. This allows us to characterize consistency-latency 
tradeoff more precisely. While we prefer t-freshness, our PCAP system (Section]^ is modular and 
could use instead t-visibility or A-atomicity for estimating data freshness. 

As noted earlier, our focus is not comparing different consistency models, nor achieving lin- 
earizability. We are interested in the un-achievable envelope of soft partition, latency requirements, 
and consistency requirements. Traditional consistency models like linearizability can be achieved 
by delaying the effect of a write. On the contrary, the achievability of t-freshness closely ties to 
the latency of read operations and underlying network behavior as discussed later. In other words, 
t-freshness by itself is not a complete definition. 

2.1.1 Use case for t — freshness 

Consider a bidding application (e.g., eBay), where everyone can post a bid, and we want every 
other participant to see posted bids as fast as possible. Assume that User 1 submits a bid, which 
is implemented as a write request (Figure]^. User 2 requests to read the bid before the bid write 
process finishes. The same User 2 then waits a finite amount of time after the bid write completes 
and submits another read request. Both of these read operations must reflect User I’s bid, whereas 
t-visibility only reflects the write in User 2’s second read (with suitable choice of t). The bid write 
request duration can include time to send back an acknowledgment to the client, even after the bid 
has committed (on the servers). A client may not want to wait that long to see a submitted bid. 
This is especially true when the auction is near the end. 

Time 0 t 

< - > t" 

User 1 Write (bid) 

< - ^ 

User 2 First Read 

< - > 

User 2 Second Read 

Figure 2: Example motivating use of Definition^ 

We define our probabilistic notion of latency as follows: 

Definition 3 t-latency: A read operation R is said to satisfy t-latency if and only if it completes 
within t time units of its start time. 

^The subscripts c and ic stand for consistency and inconsistency, respectively. 
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Definition 4 Probabilistic Latency: A key-value store satisfies {ta, Pua)-lo-t^nc'^if in any execution 
of the system, the fraction of ta-latency read operations is at least (1 — pua)- 


Intuitively, given response time requirement ta, 
Pua is the likelihood of a read violating the ta- 


Finally, we capture the concept of a soft partition of the network by defining a probabilistic 
partition model. In this section, we assume that the partition model for the network does not 
change over time. (Later, our implementation and experiments in Section will measure the effect 
of time-varying partition models.) 

In a key-value store, data can propagate from one client to another via the other servers using 
different approaches. For instance, in Apache Cassandra [39], a write might go from a writer client 
to a coordinator server to a replica server, or from a replica server to another replica server in the 
form of read repair |2T]. Our partition model captures the delay of all such propagation approaches. 

Definition 5 Probabilistic Partition: 

An execution is said to suffer {tp, a)-partition if the fraction f of paths from one client to 
another client, via a server, which have latency higher than tp, is such that f > a. 

Our delay model loosely describes the message delay caused by any underlying network behavior 
without relying on the assumptions on the implementation of the key-value store. We do not assume 
eventual delivery of messages. We neither define propagation delay for each message nor specify 
the propagation paths (or alternatively, the replication mechanisms). This is because we want to 
have general lower bounds that apply to all systems that satisfy our models. 

2.2 Impossibility Results 

We now present two theorems that characterize the consistency-latency tradeoff in terms of our 
probabilistic models. 

First, we consider the case when the client has tight expectations, i.e., the client expects all 
data to be fresh within a time bound, and all reads need to be answered within a time bound. 

Theorem 1 If tc -I ta < tp, then it is impossible to implement a read/write data object under a 
{tp,0)-partition while achieving {tc,0)-coiiisistencY, and (ta, 0)-latency, i.e., there exists an execution 
such that these three properties cannot be satisfied simultaneously. 

Proof: The proof is by contradiction. In a system that satisfies all three properties in all execu¬ 
tions, consider an execution with only two clients, a writer client and a reader client. There are 
two operations: (i) the writer client issues a write W, and (ii) the reader client issues a read R at 
time Tstart{R) = Tstart{W^) + tc- Due to (tc, 0)-consistency, the read R must return the value from 

W. 

Let the delay of the write request W be exactly tp units of time (this obeys (tp, 0)-partition). 
Thus, the earliest time that W’s value can arrive at the reader client is {TstartiW) + tp). However, to 
satisfy (ta, 0)-latency, the reader client must receive an answer by time Tstart{R) + ta = Tstart{W) + 

^The subscripts a and ua stand for availability and unavailability, respectively. 
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tc + ta- However, this time is earlier than {Tstart{W) + tp) because tc + ta < tp. Hence, the value 
returned by W cannot satisfy {tc, 0)-consistency. This is a contradiction. □ 

Essentially, the above theorem relates the clients’ expectations of freshness (tc) and latency 
{ta) to the propagation delays {tp). If client expectations are too stringent when the maximum 
propagation delay is large, then it may not be possible to guarantee both consistency and latency 
expectations. 

However, if we allow a fraction of the reads to return late (i.e., after ta), or return tc-stale values 
(i.e., when either pic or pua is non-zero), then it may be possible to satisfy the three properties 
together even if tc + ta < tp. Hence, we consider non-zero picPua and a in our second theorem. 

Theorem 2 If tc + ta < tp, and pua + Pic < ol, then it is impossible to implement a read/write 
data object under a {tp, a)-partition while achieving (tc,Pic)-consistency, and (ta,Pua)-latency, i.e., 
there exists an execution such that these three properties cannot he satisfied simultaneously. 

Proof: The proof is by contradiction. In a system that satisfies all three properties in all execu¬ 
tions, consider an execution with only two clients, a writer client and a reader client. The execution 
contains alternating pairs of write and read operations Wi, Ri, W 2 , R 2 , ■ ■ ■, Wn, Rn, such that: 

1. Write Wi starts at time {tc + ta) ■ {i — 1), 

2. Read Ri starts at time {tc + ta) • (i — 1) -|- tc, and 

3. Each write Wi writes a distinct value Vi. 

By our definition of {tp, a)-partition, there are at least n • a written values vj’s that have propa¬ 
gation delay > tp. By a similar argument as in the proof of Theoremeach of their corresponding 
reads Rj are such that Rj cannot both satisfy tc-freshness and also return within ta. That is, Rj 
is either tc-stale or returns later than ta after its start time. There are n • a such reads Rj] let us 
call these “bad” reads. 

By definition, the set of reads S that are tc-stale, and the set of reads A that return after ta are 
such that |5| < n • pic and |^| < n ■ pua- Put together, these imply: 

n- a<\SU A\<\S\ + \A\<n-pic + n ■ pua- 

The first inequality arises because all the “bad” reads are in 5 U A. But this inequality implies 
that a < Pua + Pic) which violates our assumptions. □ 


3 PCAP Key-value Stores 

Having formally specified the (un)achievable envelope of consistency-latency (Theorem]^, we now 
move our attention to designing systems that achieve performance close to this theoretical envelope. 
We also convert our probabilistic models for consistency and latency from Section into SLAs, 
and show how to design adaptive key-value stores that satisfy such probabilistic SLAs inside a 
single data-center. We call such systems PCAP systems. So PCAP systems (I) can achieve 
performance close to the theoretical consistency-latency tradeoff envelope, and (2) can adapt to 
meet probabilistic consistency and latency SLAs inside a single data-center. Our PCAP systems 
can also alternatively be used with SLAs from PBS [7j or Pileus [581 [3]. 
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Assumptions about underlying Key-value Store PCAP systems can be built on top of 
existing key-value stores. We make a few assumptions about such key-value stores. First, we 
assume that each key is replicated on multiple servers. Second, we assume the existence of a 
“coordinator” server that acts as a client proxy in the system, hnds the replica locations for a key 
(e.g., using consistent hashing [56]), forwards client queries to replicas, and finally relays replica 
responses to clients. Most key-value stores feature such a coordinator [321 El] • Third, we assume 
the existence of a background mechanism such as read repair [2T] for reconciling divergent replicas. 
Finally, we assume that the clocks on each server in the system are synchronized using a protocol 
like NTP so that we can use global timestamps to detect stale data (most key-value stores running 
within a datacenter already require this assumption, e.g., to decide which updates are fresher). It 
should be noted that our impossibility results in Section do not depend on the accuracy of the 
clock synchronization protocol. However the sensitivity of the protocol affects the ability of PCAP 
systems to adapt to network delays. For example, if the servers are synchronized to within 1 ms 
using NTP, then the PCAP system cannot react to network delays lower than 1 ms. 

SLAs We consider two scenarios, where the SLA specihes either: i) a probabilistic latency re¬ 
quirement, or ii) a probabilistic consistency requirement. In the former case, our adaptive system 
optimizes the probabilistic consistency while meeting the SLA requirement, whereas in the latter it 
optimizes probabilistic latency while meeting the SLA. These SLAs are probabilistic, in the sense 
that they give statistical guarantees to operations over a long duration. 

A latency SLA (i) looks as follows: 


Given: Latency SLA =< PuaAa'^Ac^' 

Ensure that: The fraction pua of reads, whose finish and start times differ by more than 
is such that: pua stays below p^^ ; 

Minimize: The fraction pic of reads which do not satisfy t®^“-freshness. 


This SLA is similar to latency SLAs used in industry today. As an example, consider a shopping 
cart application [58| where the client requires that at most 10% of the operations take longer than 
300 ms, but wishes to minimize staleness. Such an application prefers latency over consistency. In 
our system, this requirement can be specified as the following PCAP latency SLA: 

< Pua, >=< 0.1, 300 ms, 0 ms >. 

A consistency SLA looks as follows: 


Given: Consistency SLA =< pf^, t^°-, >; 

Ensure that: The fraction pic of reads that do not satisfy t^^“-freshness is such that: pic 
stays below pfj^ ; 

Minimize: The fraction pua of reads whose finish and start times differ by more than 


Note that as mentioned earlier, consistency is defined based on freshness of the value returned 
by read operations. As an example, consider a web search application that wants to ensure no more 
than 10% of search results return data that is over 500 ms old, but wishes to minimize the fraction 
of operations taking longer than 100 ms [58]. Such an application prefers consistency over latency. 
This requirement can be specified as the following PCAP consistency SLA: 

< >=< 0.10, 500 ms, 100 ms >. 
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Figure 3: Effect of Various Control Knobs. 


Our PCAP system can leverage three control knobs to meet these SLAs: 1) read delay, 2) read 
repair rate, and 3) consistency level. The last two of these are present in most key-value stores. 
The first (read delay) has been discussed in previous literature [S7l 171 [551 [^ . 

3.1 Control Knobs 

Table shows the effect of our three control knobs on latency and consistency. We discuss each of 
these knobs and explain the entries in the table. 

The knobs of Table are all directly or indirectly applicable to the read path in the key-value 
store. As an example, the knobs pertaining to the Cassandra query path are shown in Fig. which 
shows the four major steps involved in answering a read query from a front-end to the key-value 
store cluster: (1) Client sends a read query for a key to a coordinator server in the key-value store 
cluster; (2) Coordinator forwards the query to one or more replicas holding the key; (3) Response 
is sent from replica(s) to coordinator; (4) Coordinator forwards response with highest timestamp 
to client; (5) Coordinator does read repair by updating replicas, which had returned older values, 
by sending them the freshest timestamp value for the key. Step (5) is usually performed in the 
background. 



Figure 4: Cassandra Read Path and PCAP Control Knobs. 

A read delay involves the coordinator artificially delaying the read query for a specified duration 
of time before forwarding it to the replicas, i.e., between step (1) and step (2). This gives the system 
some time to converge after previous writes. Increasing the value of read delay improves consistency 
(lowers Pic) and degrades latency (increases Pua)- Decreasing read delay achieves the reverse. Read 
delay is an attractive knob because: 1) it does not interfere with client specified parameters (e.g., 
consistency level in Cassandra [H]), and 2) it can take any non-negative continuous value instead 
of only discrete values allowed by consistency levels. Our PCAP system inserts read delays only 
when it is needed to satisfy the specified SLA. 










However, read delay cannot be negative, as one cannot speed up a query and send it back in 
time. This brings us to our second knob: read repair rate. Read repair was depicted as distinct 
step (5) in our outline of Fig. and is typically performed in the background. The coordinator 
maintains a buffer of recent reads where some of the replicas returned older values along with 
the associated freshest value. It periodically picks an element from this buffer and updates the 
appropriate replicas. In key-value stores like Apache Cassandra and Riak, read repair rate is an 
accessible configuration parameter per column family. 

Our read repair rate knob is the probability with which a given read that returned stale replica 
values will be added to the read repair buffer. Thus, a read repair rate of 0 implies no read 
repair, and replicas will be updated only by subsequent writes. Read repair rate = 0.1 means the 
coordinator performs read repair for 10% of the read requests. 

Increasing (respectively, decreasing) the read repair rate can improve (respectively degrade) 
consistency. Since the read repair rate does not directly affect the read path (Step (5) described 
earlier, is performed in the background), it does not affect latency. Table summarizes this 
behaviorQ 

The third potential control knob is consistency level. Some key-value stores allow the client to 
specify, along with each read or write operation, how many replicas the coordinator should wait 
for (in step (3) of Fig. before it sends the reply back in step (4). For instance, Cassandra offers 
consistency levels: ONE, TWO, QUORUM, ALL. As one increases consistency level from ONE to ALL, 
reads are delayed longer (latency decreases) while the possibility of returning the latest write rises 
(consistency increases). 

Our PCAP system relies primarily on read delay and repair rate as the control knobs. Con¬ 
sistency level can be used as a control knob only for applications in which user expectations will 
not be violated, e.g., when reads do not specify a specific discrete consistency level. That is, if a 
read specifies a higher consistency level, it would be prohibitive for the PCAP system to degrade 
the consistency level as this may violate client expectations. Techniques like continuous partial 
quorums (CPQ) [50], and adaptive hybrid quorums [T9| fall in this category, and thus interfere 
with application/client expectations. Further, read delay and repair rate are non-blocking control 
knobs under replica failure, whereas consistency level is blocking. For example, if a Cassandra client 
sets consistency level to QUORUM with replication factor 3, then the coordinator will be blocked if 
two of the key’s replicas are on failed nodes. On the other hand, under replica failures read repair 
rate does not affect operation latency, while read delay only delays reads by a maximum amount. 

3.2 Selecting A Control Knob 

As the primary control knob, the PCAP system prefers read delay over read repair rate. This 
is because the former allows tuning both consistency and latency, while the latter affects only 
consistency. The only exception occurs when during the PCAP system adaptation process, a state 
is reached where consistency needs to be degraded (e.g., increase pic to be closer to the SLA) but 
the read delay value is already zero. Since read delay cannot be lowered further, in this instance 
the PCAP system switches to using the secondary knob of read repair rate, and starts decreasing 
this instead. 

Another reason why read repair rate is not a good choice for the primary knob is that it takes 
longer to estimate pic than for read delay. Because read repair rate is a probability, the system 

"^Although read repair rate does not affect latency directly, it introduces some background traffic and can im¬ 
pact propagation delay. While our model ignores such small impacts, our experiments reflect the net effect of the 
background traffic. 
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1: procedure CONTROl( 5£^ =< “ >, e) 

o, .— ^^sld _ 

Fie • i^2C 

3: Select controLknob; // (Sections |3.1| , |3.2[ ) 

4: inc := 1; 

5: dir = +1; 

6: while (true) do 

7: Inject k new operations (reads and writes) 

8: into store; 

9: Collect log L of recent completed reads 

10: and writes (values, start and finish times); 

11: Use £ to calculate 

12: Pic andpua; // (Section [sTi] ) 

13: new-dir := {pic > pfe'V + 1 • “1; 

14: if new-dir = dir then 

15: inc := inc *2; j j Multiplicative increase 

16: if inc > MAXJNC then 

17: inc := MAXJNC: 

18: end if 

19: else 

20: inc := 1; // Reset to unit step 

21: dir ;= newJir; j j Change direction 

22: end if 

23: controUnob := controUnob + inc * dir; 

24: end while 

25: end procedure 


Figure 5: Adaptive Control Loop for Consistency SLA. 


needs a larger number of samples (from the operation log) to accurately estimate the actual pic 
resulting from a given read repair rate. For example, in our experiments, we observe that the 
system needs to inject k > 3000 operations to obtain an accurate estimate of pic, whereas only 
k = 100 suffices for the read delay knob. 

3.3 PCAP Control Loop 

The PCAP control loop adaptively tunes control knobs to always meet the SLA under continuously 
changing network conditions. The control loop for consistency SLA is depicted in Fig.[^ The control 
loop for a latency SLA is analogous and is not shown. 

This control loop runs at a standalone server called the PCAP Coordinator!^ This server runs 
an infinite loop. In each iteration, the coordinator: i) injects k operations into the store (line 
6), ii) collects the log £ for the k recent operations in the system (line 8), iii) calculates Pua,Pic 
(Section |3.4[ ) from £ (line 10), and iv) uses these to change the knob (lines 12-22). 

The behavior of the control loop in Fig. is such that the system will converge to “around” the 
specihed SLA. Because our original latency (consistency) SLAs require pua (pic) to stay below the 
SLA, we introduce a laxity parameter e, subtract e from the target SLA, and treat this as the target 

®The PCAP Coordinator is a special server, and is different from Cassandra’s use of a coordinator for clients to 
send reads and writes. 
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SLA in the control loop. Concretely, given a target consistency SLA < >, where the 

goal is to control the fraction of stale reads to be under we control the system such that pic 
quickly converges around pfj^ = pf^ — e, and thus stay below pfj^. Small values of e suffice to 
guarantee convergence (for instance, our experiments use e < 0.05). 

We found that the naive approach of changing the control knob by the smallest unit increment 
(e.g., always 1 ms changes in read delay) resulted in a long convergence time. Thus, we opted for 
a multiplicative approach (Fig. lines 12-22) to ensure quick convergence. 

We explain the control loop via an example. For concreteness, suppose only the read delay 
knob (Section 3.1) is active in the system, and that the system has a consistency SLA. Suppose pic 
is higher than . The multiplicative-change strategy starts incrementing the read delay, initially 
starting with a unit step size (line 3). This step size is exponentially increased from one iteration 
to the next, thus multiplicatively increasing read delay (line 14). This continues until the measured 
Pic goes just under . At this point, the new-dir variable changes sign (line 12), so the strategy 
reverses direction, and the step is reset to unit size (lines 19-20). In subsequent iterations, the 
read delay starts decreasing by the step size. Again, the step size is increased exponentially until 
Pic just goes above pff' . Then its direction is reversed again, and this process continues similarly 
thereafter. Notice that (lines 12-14) from one iteration to the next, as long as pic continues to 
remain above (or below) pf^ , we have that: i) the direction of movement does not change, and 
ii) exponential increase continues. At steady state, the control loop keeps changing direction with 
a unit step size ( bounded oscillation), and the metric stays converged under the SLA. Although 
advanced techniques such as time dampening can further reduce oscillations, we decided to avoid 
them to minimize control loop tuning overheads. Later in Section we utilized control theoretic 
techniques for the control loop in geo-distributed settings to reduce excessive oscillations. 

In order to prevent large step sizes, we cap the maximum step size (line 15-17). For our 
experiments, we do not allow read delay to exceed 10 ms, and the unit step size is set to 1 ms. 

We preferred active measurement (whereby the PCAP Coordinator injects queries rather than 
passive due to two reasons: i) the active approach gives the PCAP Coordinator better control on 
convergence, thus convergence rate is more uniform over time, and ii) in the passive approach if the 
client operation rate were to become low, then either the PCAP Coordinator would need to inject 
more queries, or convergence would slow down. Nevertheless, in Section 5.4.7, we show results 
using a passive measurement approach. Exploration of hybrid active-passive approaches based on 
an operation rate threshold could be an interesting direction. 

Overall our PCAP controller satisfies SASO (Stability, Accuracy, low Settling time, small Over¬ 
shoot) control objectives [55] . 


3.4 Complexity of Computing pua and pic 

We show that the computation of pua and pic (line 10, Fig. is efficient. Suppose there are r 
reads and w writes in the log, thus log size k = r + w. Calculating pua makes a linear pass over 
the read operations, and compares the difference of their finish and start times with tn- This takes 
0(r) = 0(fe). 

Pic is calculated as follows. We first extract and sort all the writes according to start timestamp, 
inserting each write into a hash table under key <object value, write key, write timestamp>. In a 
second pass over the read operations, we extract its matching write by using the hash table key (the 
third entry of the hash key is the same as the read’s returned value timestamp). We also extract 
neighboring writes of this matching write in constant time (due to the sorting), and thus calculate 
tc-freshness for each read. The first pass takes time 0{r + w + w log w), while the second pass takes 
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0{r + w). The total time complexity to calculate pic is thus 0{r + w + wlogw) = 0{klogk). 


4 PCAP for Geo-distributed Settings 

In this section we extend our PCAP system from a single data-center to multiple geo-distributed 
data-centers. We call this system GeoPCAP. 

4.1 System Model 

Assume there are n data-centers. Each data-center stores multiple replicas for each data-item. 
When a client application submits a query, the query is first forwarded to the data-center closest 
to the client. We call this data-center the local data-center for the client. If the local data-center 
stores a replica of the queried data item, that replica might not have the latest value, since write 
operations at other data-centers could have updated the data item. Thus in our system model, the 
local data-center contacts one or more of other remote data-centers, to retrieve (possibly) fresher 
values for the data item. 

4.2 Probabilistic Composition Rules 

Each data-center is running our PCAP-enabled key-value store. Each such PCAP instance defines 
per data-center probabilistic latency and consistency models (Section]^. To obtain the global 
behavior, we need to compose these probabilistic consistency and latency/availability models across 
different data-centers. This is done by our composition rules. 

The composition rules for merging independent latency/consistency models from data-centers 
check whether the SLAs are met by the composed system. Since single data-center PCAP systems 
define probabilistic latency and consistency models, our composition rules are also probabilistic in 
nature. However in reality, our composition rules do not require all data-centers to run PCAP- 
enabled key-value stores systems. As long as we can measure consistency and latency at each 
data-center, we can estimate the probabilistic models of consistency/latency at each data-center 
and use our composition rules to merge them. 

We consider two types of composition rules: (I) QUICKEST (Q), where at-least one data-center 
(e.g., the local or closest remote data-center) satisfies client specified latency or freshness (consis¬ 
tency) guarantees; and (2) ALL (A), where all the data-centers must satisfy latency or freshness 
guarantees. These two are, respectively, generalizations of Apache Cassandra multi-data-center 
deployment [29j consistency levels (CL): LOCAL.QOURUM and EACH_QUORUM. 

Compared to Section which analyzed the fraction of executions that satisfy a predicate (the 
proportional approach), in this section we use a simpler probabilistic approach. This is because 
although the proportional approach is more accurate, it is more intractable than the probabilistic 
model in the geo-distributed case. 

Our probabilistic composition rules fall into three categories: (1) composing consistency models; 
(2) composing latency models; and (3) composing a wide-area-network (WAN) partition model with 
a data-center (consistency or latency) model. The rules are summarized in Figure|^ and we discuss 
them next. 
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N. 

A. 

Pr[X + Y >tc + tf ]> Pic ■ 

Latency-WAN 

N. A. 

N. 

A. 

Pr[X + Y >ta + t^]> Pua ■ 


Figure 6: GeoPCAP Composition Rules. 


4.2.1 Composing latency models 

Assume there are n data-centers storing the replica ofakey with latency models (ta,Pua)) {^'a^Pua)^ ■ ■ ■ ^ i'ta^Pua)- 
Let denote the composed system. Let {p^ai ^a) denote the latency model of the composed system 
C^. This indicates that the fraction of reads in that complete within time units is at least 
(1 —Pua)- This is the latency SLA expected by clients. Let p^aif) denote the probability of missing 
deadline by t time units in the composed model. Let Xj denote the random variable measuring 
read latency in data center j. Let Ej{t) denote the event that Xj > t. By definition we have that, 

Pr[Ej{ti)] = pLa, and Pr[Ej{ti)] = 1 — pLa- Let fj{t) denote the cumulative distribution function 
(CDF) for Xj. So by definition, fj{ti) = Pr[Xj < ti] = 1 — Pr[Xj > ti]. The following theorem 
articulates the probabilistic latency composition rules: 

Theorem 3 Let n data-centers store the replica for a key with latency models {ta,Pua)^ i'^a^Pua)^ • • • > i'^aiPua)- 
LetC^ denote the composed system with latency model {pfai ^a)- Then for composition rule QUICKEST 
we have: 


Plaimiuj tl) > Bj > Puaimaxj tf), 
and miuj tf<t‘^< maxj tf, 
wherej G {1, • • • , n}. 


( 1 ) 


For composition rule ALL, 


Plaimirij tl)>l- lij {l-pia) > Puaimaxj tf), 
and minj Pa < < maxj Pa, 

wherej G {1, • • • , n}. 


( 2 ) 


Proof: We outline the proof for composition rule QUICKEST. In QUICKEST, a latency deadline t 
is violated in the composed model when all data-centers miss the t deadline. This happens with 
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probability p^ai^) (by definition). We first prove a simpler Case 1, then the general version in Case 

2 . 

Case 1; Consider the simple case where all ta values are identical, i.e., yj,ti = ta- Puai^a) = 
Pr[nj£'j(ta)] = ^iPr{Ei{ta)\ = P-iPua (assuming independence across data-centers). 

Case 2: 

Let, 


Then, 

= miuj ti 

(3) 

Vj, ti > 

Then, by dehnition of CDF function. 

(4) 

By definition, 

Vj, 

(5) 


Vj, {Pr[X, < f*] < Pr[X, < ti]) 

(6) 

Multiplying all. 

Vj, {Pr[X, > ti] > Pr[X, > ti]) 

(7) 

But this means. 

n, Pr[X, > ti] > n, Pr[X, > ti] 

(8) 


pLiti) > Uj pi^ 

(9) 

Similarly, let 

piaimirij ti) > Uj 

(10) 

Then, 

ta = maxj ti 

(11) 


Vj, ti > ti 

(12) 


Vj, {Pr[X, > ti] > Pr[Xj > ti]) 

(13) 


n, Pr[X, > ti] > n, Pr[X, > ti] 

(14) 
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(15) 


^jPLa>Puaita) 


pLa > PLimaxj ti) 


(16) 


Finally combining Equations 10 and [TBl we get Equation 


The proof for composition rule ALL follows similarly. In this case, a latency deadline t is satisfied 
when all data-centers satisfy the deadline. So a deadline miss in the composed model means at-least 
one data-center misses the deadline. The derivation of the composition rules are similar and we 
invite the reader to work them out to arrive at the equations depicted in Figure]^ □ 


4.2.2 Composing consistency models 


W start f R start ' Rend time - > 

to 

Freshness deadline in the past Latency deadline in the future 

Figure 7: Symmetry of freshness and latency requirements. 

t-latency (Definition]^ and t-freshness (Definitions]^ guarantees are time-symmetric (Figure]^. 

While t-lateness can be considered a deadline in the future, t-freshness can be considered a deadline 
in the past. This means that for a given read, t-freshness constrains how old a read value can be. 

So the composition rules remain the same for consistency and availability. 

Thus the consistency composition rules can be obtained by substituting pua with pic and ta 
with tc in the latency composition rules (last 4 rows in Table ]^. 

This leads to the following theorem for consistency composition: 

Theorem 4 Letn data-centers store the replica for a key with consistency models {tl,pjf), {tl,pff ),..., {tf,p^f). 
Let denote the composed system with consistency model {Pic^ff). Then for composition rule 
QUICKEST we have: 


Picimiuj ti) > Uj pI > picimaxj ti), 
and miuj ti <ti < maxj ti, 
wherej G {1, • • • , n}. 


(17) 


For composition rule ALL, 


Picimiuj ti)>l- Uj (l-pif) > picimaxj ti), 
and minj ti < t^ < maxj ti, 
where j G {1, • • • ,n}. 


(18) 
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4.2.3 Composing consistency/latency model with a WAN partition model 

All data-centers are connected to each other through a wide-area-network (WAN). We assume 
the WAN follows a partition model {tp,a^). This indicates that fraction of messages passing 
through the WAN suffers a delay > tp. Note that the WAN partition model is distinct from the per 
data-center partition model (Definition]^. Let X denote the latency in a remote data-center, and 
Y denote the WAN latency of a link connecting the local data-center to this remote data-center 
(with latency X). Then the total latency of the path to the remote data-center is A -|- y-g 

Pr[X + F > + t°] > (Pr|X > tj ■ Pr[F > 4°]) = (19) 

Here we assume the WAN latency, and data-center latency distributions are independent. Note 
that Equation]^ gives a lower bound of the probability. In practice we can estimate the probability 
by sampling both X and Y, and estimating the number of times {X + Y) exceeds {ta + tp). 

4.3 Example 

The example in Figure]^ shows the composition rules in action. In this example, there is one local 
data-center and 2 replica data-centers. Each data-center can hold multiple replicas of a data-item. 
First we compose each replica data-center latency model with the WAN partition model. Second 
we take the WAN-latency composed models for each data-center and compose them using the 
QUICKEST rule (Figure]^ bottom part). 

4.4 GeoPCAP Control Knob 

We use a similar delay knob to meet the SLAs in a geo-distributed setting. We call this the geo¬ 
delay knob and denote it as A. The time delay A is the delay added at the local data-center 
to a read request received from a client before it is forwarded to the replica data-centers. A 
affects the consistency-latency trade-off in a manner similar to the read delay knob in a data-center 
(Section |3.1[ ). Increasing the knob tightens the deadline at each replica data-center, thus increasing 
per data-center latency {pua)- Similar to read delay (Figure]^, increasing the geo delay knob 
improves consistency, since it gives each data-center time to commit latest writes. 


4.5 GeoPCAP Control Loop 


Our GeoPCAP system uses a control loop depicted in Figure]^ for the Consistency SLA case using 
the QUICKEST composition rule. The control loops for the other three combinations (Consistency- 
QUICKEST, Latency-ALL, Latency-QUICKEST) are similar. 

Initially, we opted to use the single data-center multiplicative control loop (Section |3.3[ ) for 
GeoPCAP. However, the multiplicative approach led to increased oscillations for the composed 
consistency (pic) and latency (pua) metrics in a geo-distributed setting. The multiplicative approach 
sufficed for the single data-center PCAP system, since the oscillations were bounded in steady- 
state. However, the increased oscillations in a geo-distributed setting prompted us to use a control 
theoretic approach for GeoCAP. 

As a result, we use a PID control theory approach [3] for the GeoPCAP controller. The controller 
runs an infinite loop, so that it can react to network delay changes and meet SLAs. There is a 


®We ignore the latency of the local data-center in this rule, since the local data-center latency is used in the latency 
composition rule (Section 4.2.11. 
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procedure CONTROL (5£^ =< pfj^, “ >) 

Geo-delay A := 0 

E := 0, ErroToid ■= 0 

set kp, kd, ki for PID control (tuning) 

Let {tp,a^) be the WAN partition model 
while (true) do 

for each data-center i do 

Let Fi denote the random freshness interval at i 
Let Li denote the random operation latency at i 
Let Wi denote the WAN latency of the link to i 
Estimate := Pr[Fi + Wi> t\ 

Estimate 
end for 

Pic := ^iPL, Pla ■= Hip: 

Error '■= plf. — pfj^ 
dE := Error — Err or old 
E := E + Error 


+ tp + A] // WAN composition (Section 4.2.3) 

G A _ +sla^ 


PT[Li + Wi > ta + tp — A — t 

II Consistency/Latency composition (Sections 


'>\a 


4.2.1 


4.2.2) 


u := kp ■ Error + kd ■ dE + ki ■ E 

A := A + u] 

end while 
end procedure 


Figure 9: Adaptive Control Loop for GeoPCAP Consistency SLA (QUICKEST Composition). 
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tunable sleep time at the end of each iteration (1 sec in Section 5.5 simulations). Initially the 
geo-delay A is set to zero. At each iteration of the loop, we use the composition rules to estimate 


where t = “ -|-tp — A. We also keep track of composed p^ai) values. We then compute the 

error, as the difference between current composed pic and the SLA. Finally the geo-delay change is 
computed using the PID control law [2] as follows: 


u = kp ■ Error (t) + kd 


dError{t) 

dt 


+ ki- / Error{t)dt 


( 20 ) 


Here, kp, kd, ki represent the proportional, differential, and integral gain factors for the PID 
controller respectively. There is a vast amount of literature on tuning these gain factors for different 
control systems [2]. Later in our experiments, we discuss how we set these factors to get SLA 
convergence. Finally at the end of the iteration, we increase A by u. Note that u could be 
negative, if the metric is less than the SLA. 


Note that for the single data-center PCAP system, we used a multiplicative control loop (Sec¬ 
tion 3.3), which outperformed the unit step size policy. For GeoPCAP, we employ a PID control 
approach. PID is preferable to the multiplicative approach, since it guarantees fast convergence, 
and can reduce oscillation to arbitrarily small amounts. However PID’s stability depends on proper 
tuning of the gain factors, which can result in high management overhead. On the other hand the 
multiplicative control loop has a single tuning factor (the multiplicative factor), so it is easier 
to manage. Later in Section |5.5| we experimentally compare the PID and multiplicative control 
approaches. 


5 Experiments 

5.1 Implementation Details 

In this section, we discuss how support for our consistency and latency SLAs can be easily in¬ 
corporated into the Cassandra and Riak key-value stores (in a single data-center) via minimal 
changes. 


5.1.1 PCAP Coordinator 


From Section 3.3 recall that the PCAP Coordinator runs an infinite loop that continuously injects 
operations, collects logs {k = 100 operations by default), calculates metrics, and changes the control 
knob. We implemented a modular PCAP Coordinator using Python (around 100 LOG), which can 
be connected to any key-value store. 

We integrated PCAP into two popular NoSQL stores: Apache Cassandra m and Riak 
each of these required changes to about 50 lines of original store code. 


5.1.2 Apache Cassandra 

First, we modified the Cassandra vl.2.4 to add read delay and read repair rate as control knobs. 
We changed the Cassandra Thrift interface so that it accepts read delay as an additional parameter. 
Incorporating the read delay into the read path required around 50 lines of Java code. 

Read repair rate is specified as a column family configuration parameter, and thus did not require 
any code changes. We used YCSB’s Cassandra connector as the client, modified appropriately to 
talk with the clients and the PCAP Coordinator. 
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5.1.3 Riak 


We modified Riak vl.4.2 to add read delay and read repair as control knobs. Due to the unavail¬ 
ability of a YCSB Riak connector, we wrote a separate YCSB client for Riak from scratch (250 
lines of Java code). We decided to use YCSB instead of existing Riak clients, since YCSB offers 
flexible workload choices that model real world key-value store workloads. 

We introduced a new system-wide parameter for read delay, which was passed via the Riak 
http interface to the Riak coordinator which in turn applied it to all queries that it receives from 
clients. This required about 50 lines of Erlang code in Riak. Like Cassandra, Riak also has built-in 
support for controlling read repair rate. 


5.2 Experiment Setup 


Our experiments are in three stages: microbenchmarks for a single data-center (Section 5.3) and 
deployment experiments for a single data-center (Section 5.4), and a realistic simulation for the 
geo-distributed setting (Section 5.5). We first discuss the experiments for a single data-center 
setting. 


Our single data-center PCAP Cassandra system and our PCAP Riak system were each run with 
their default settings. We used YCSB v 0.1.4 m to send operations to the store. YCSB generates 
synthetic workloads for key-value stores and models real-world workload scenarios (e.g., Facebook 
photo storage workload). R has been used to benchmark many open-source and commercial key- 
value stores, and is the de facto benchmark for key-value stores |15| . 

Each YCSB experiment consisted of a load phase, followed by a work phase. Unless otherwise 
specihed, we used the following YCSB parameters: 16 threads per YCSB instance, 2048 B values, 
and a read-heavy distribution (80% reads). We had as many YCSB instances as the cluster size, 
one CO- located at each server. The default key size was 10 B for Cassandra, and Riak. Both YCSB- 
Cassandra and YCSB-Riak connectors were used with the weakest quorum settings and 3 replicas 
per key. The default throughput was 1000 ops/s. All operations use a consistency level of ONE. 

Both PCAP systems were run in a cluster of 9 d710 Emulab servers |60) . each with 4 core 
Xeon processors, 12 GB RAM, and 500 GB disks. The default network topology was a LAN (star 
topology), with 100 Mbps bandwidth and inter-server round-trip delay of 20 ms, dynamically 
controlled using traffic shaping. 


We used NTP to synchronize clocks within 1 ms. This is reasonable since we are limited to a 
single data-center. This clock skew can be made tighter by using atomic or GPS clocks m- This 
synchronization is needed by the PCAP coordinator to compute the SLA metrics. 


5.3 Microbenchmark Experiments (Single Data-center) 
5.3.1 Impact of Control Knobs on Consistency 


We study the impact of two control knobs on consistency: read delay and read repair rate. 

Fig. shows the inconsistency metric pic against tc for different read delays. This shows that 
when applications desire fresher data (left half of the plot), read delay is flexible knob to control 
inconsistency pic- When the freshness requirements are lax (right half of plot), the knob is less 
useful. However, pic is already low in this region. 


On the other hand, read repair rate has a relatively smaller effect. We found that a change in 
read repair rate from 0.1 to 1 altered pic by only 15%, whereas Fig. 10 showed that a 15 ms increase 
in read delay (at tc = 0 ms) lowered inconsistency by over 50%. As mentioned earlier, using read 
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Figure 10: Effectiveness of Read Delay knob in PCAP Cassandra. Read repair rate fixed at 0.1. 

repair rate requires calculating pic over logs of at least k = 3000 operations, whereas read delay 
worked well with k = 100. Henceforth, by default we use read delay as our sole control knob. 

5.3.2 PCAP vs. PBS 



0 10 20 30 40 50 

t(ms) 

Figure 11: pic PCAP vs. PBS consistency metrics. Read repair rate set to 0.1, 50% writes. 


Fig. 11 compares, for a 50%-write workload, the probability of inconsistency against t for both 
existing work PBS (t-visibility) [7j and PCAP (t-freshness) described in Section 2.1 We observe 
that PBS’s reported inconsistency is lower compared to PCAP. This is because, PBS considers a 
read that returns the value of an in-flight write (overlapping read and write) to be always fresh, 
by default. However the comparison between PBS and PCAP metrics is not completely fair, since 
the PBS metric is defined in terms of write operation end times, whereas our PCAP metric is 
based on write start times. It should be noted that the purpose of this experiment is not to show 
which metric captures client-centric consistency better. Rather, our goal is to demonstrate that our 
PCAP system can be made to run by using PBS t-visibility metric instead of PCAP t-freshness. 


5.3.3 PCAP Metric Computation Time 

Fig. |12| shows the total time for the PCAP Coordinator to calculate pic and pua metrics for values 
of k from 100 to lOK, and using multiple threads. We observe low computation times of around 1.5 
s, except when there are 64 threads and a lOK-sized log: under this situation, the system starts to 
degrade as too many threads contend for relatively few memory resources. Henceforth, the PCAP 
Coordinator by default uses a log size of A; = 100 operations and 16 threads. 
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Number of threads 

Figure 12: PCAP Coordinator time taken to both collect logs and compute pic and pua in PCAP 
Cassandra. 
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Figure 13: Deployment Experiments: Summary of Settings and Parameters. 


5.4 Deployment Experiments 

We now subject our two PCAP systems to network delay variations and YCSB query workloads. 
In particular, we present two types of experiments: 1) sharp network jump experiments, where 
the network delay at some of the servers changes suddenly, and 2) lognormal experiments, which 
inject continuously-changing and realistic delays into the network. Our experiments use e < 0.05 
(Section |3.3[ ). 

Fig. summarizes the various of SLA parameters and network conditions used in our experi¬ 
ments. 


5.4.1 Latency SLA under Sharp Network Jump 


Fig. 14 shows the timeline of a scenario for PCAP Cassandra using the following latency SLA: 
Pua = 0.2375, tc = 0 ms, ta = 150 ms. 

In the initial segment of this run (t = 0 s to t = 800 s) the network delays are small; the one-way 
server-to-LAN switch delay is 10 ms (this is half the machine to machine delay, where a machine 
can be either a client or a server). After the warm up phase, by t = 400 s. Fig. 14 shows that pua 
has converged to the target SLA. Inconsistency pic stays close to zero. 


We wish to measure how close the PCAP system is to the optimal-achievable envelope (Sec¬ 
tion]^. The envelope captures the lowest possible values for consistency {pic, tc), and latency {pua, 
ta), allowed by the network partition model {a, tp) (Theorem]^. We do this by first calculating a 
for our specific network, then calculating the optimal achievable non-SLA metric, and finally seeing 
how close our non-SLA metric is to this optimal. 
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First, from Theoremwe know that the achievability region requires tc + ta > tp] hence, we set 
tp = tc + ta- Based on this, and the probability distribution of delays in the network, we calculate 
analytically the exact value of a as the fraction of client pairs whose propagation delay exceeds tp 
(see Definition]^. 

Given this value of a at time t, we can calculate the optimal value of pic as Pic{opt) = max(0, a — 


Pua)- Fig. 14 shows that in the initial part of the plot (until t = 800 s), the value of a is close to 0, 
and the pic achieved by PCAP Cassandra is close to optimal. 

At time t = 800 s in Fig. [T4l we sharply increase the one-way server-to-LAN delay for 5 out of 
9 servers from 10 ms to 26 ms. This sharp network jump results in a lossier network, as shown by 
the value of a going up from 0 to 0.42. As a result, the value of pua initially spikes - however, the 
PCAP system adapts, and by time t = 1200 s the value of pua has converged back to under the 
SLA. 

However, the high value of a(= 0.42) implies that the optimal-achievable Pic{opt) is also higher 
after t = 800 s. Once again we notice that pic converges in the second segment of Fig. 14 by 
t = 1200 s. 


To visualize how close the PCAP system is to the optimal-achievable envelope. Fig. 15 shows 
the two achievable envelopes as piecewise linear segments (named “before jump” and “after jump”) 
and the {pua,Pic) data points from our run in Fig. 14, The figure annotates the clusters of data 


points by their time interval. We observe that in the stable states both before the jump (dark 
circles) and after the jump (empty triangles) are close to their optimal-achievable envelopes. 




Figure 14: Latency SLA with PCAP Cassandra 
under Sharp Network Jump at 800 s: Timeline. 


Figure 15: Latency SLA with PCAP Cassandra 
under Sharp Network Jump: Consistency-Latency 
Scatter plot. 


Fig. 16 shows the CDF plot for p^a and pic in the steady state time interval [400 s, 800 s] of 
Fig.[T^ corresponding to the bottom left cluster from Fig.[T^ We observe that pua is always below 
the SLA. 

shows a scatter plot for our PCAP Riak system under a latency SLA (p®(f = 0.2375, 


Fig. 
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ta = 150 ms, tc = 0 ms). The sharp network jump occurs at time t = 4300 s when we increase the 
one-way server-to-LAN delay for 4 out of the 9 Riak nodes from 10 ms to 26 ms. It takes about 
1200 s for Pua to converge to the SLA (at around t = 1400 s in the warm up segment and t = 5500 s 
in the second segment). 
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Figure 16: Latency SLA with PCAP Cassandra under Sharp Network Jump: Steady State CDF 
[400 s, 800 sj. 


5.4.2 Consistency SLA under Sharp Network Jump 
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Figure 17: Latency SLA with PCAP 
Riak under Sharp Network Jump: 
Consistency-Lateney Seatter plot.. 



Figure 18: Consistency SLA with PCAP 
Cassandra under Sharp Network Jump: 
Consistency-Lateney Seatter plot. 


We present consistency SLA results for PCAP Cassandra (PCAP Riak results are similar and 
are omitted). We use = 0.15, tc = 0 ms, ta = 150 ms. The initial one-way server-to-LAN 
delay is 10 ms. At time 750 s, we increase the one-way server-to-LAN delay for 5 out of 9 nodes to 
14 ms. This changes a from 0 to 0.42. 

Fig. shows the scatter plot. First, observe that the PCAP system meets the consistency 
SLA requirements, both before and after the jump. Second, as network conditions worsen, the 
optimal-achievable envelope moves significantly. Yet the PCAP system remains close to the optimal- 
achievable envelope. The convergence time is about 100 s, both before and after the jump. 

5.4.3 Experiments with Realistic Delay Distributions 

This section evaluates the behavior of PCAP Cassandra and PCAP Riak under continuously- 
changing network conditions and a consistency SLA (latency SLA experiments yielded similar 
results and are omitted). 
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Figure 19: Consistency SLA with PCAP 
Cassandra under Lognormal delay distribution: 
Timeline. 



Figure 20: Consistency SLA with PCAP 
Cassandra under Lognormal delay distribution: 
Consistency-Lateney Scatter plot. 


Based on studies for enterprise data-centers m we use a lognormal distribution for injecting 
packet delays into the network. We modified the Linux traffic shaper to add lognormally distributed 
delays to each packet. Fig. 19 shows a timeline where initially (t = 0 to 800 s) the delays are 


lognormally distributed, with the underlying normal distributions of p = 3 ms and a = 0.3 ms. At 
t = 800 s we increase p and u to 4 ms and 0.4 ms respectively. Finally at around 2100 s, p and 
a become 5 ms and 0.5 ms respectively. Fig. 20 shows the corresponding scatter plot. We observe 
that in all three time segments, the inconsistency metric pip. i) stays below the SLA, and ii) upon 
a sudden network change converges back to the SLA. Additionally, we observe that pua converges 
close to its optimal achievable value. 


Fig. |21| shows the effect of worsening network conditions on PCAP Riak. At around t = 1300 s 
we increase p from 1 ms to 4 ms, and a from 0.1 ms to 0.5 ms. The plot shows that it takes PCAP 
Riak an additional 1300 s to have inconsistency pic converge to the SLA. Further the non-SLA 
metric pua converges close to the optimal. 



Figure 21: Consistency SLA with PCAP Riak under Lognormal delay distribution: Timeline. 

So far all of our experiments used lax timeliness requirements {ta = 150 ms, 200 ms), and 
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were run on top of relatively high delay networks. Next we perform a stringent consistency SLA 
experiment (tc = 0 ms,pic = .125) with a very tight latency timeliness requirement {ta = 25 ms). 
Packet delays are still lognormally distribnted, but with lower values. Fig. 22 shows a timeline where 
initially the delays are lognormally distributed with p = 1 ms, cr = 0.1 ms. At time f = 160 s we 
increase /x and a to 1.5 ms and 0.15 ms respectively. Then at time t = 320 s, we decrease /x and 
cr to return to the initial network conditions. We observe that in all three time segments, pic stays 
below the SLA, and quickly converges back to the SLA after a network change. Since the network 
delays are very low thronghout the experiment, a is always 0. Thus the optimal pua is also 0. We 
observe that pua converges very close to optimal before the first jump and after the second jump 
(/X = 1 ms,a = 0.1 ms). In the middle time segment {t = 160 to 320 s), pua degrades in order 
to meet the consistency SLA under slightly higher packet delays. Fig. |23| shows the corresponding 
scatter plot. We observe that the system is close to the optimal envelope in the first and last time 
segments, and the SLA is always met. We note that we are far from optimal in the middle time 
segment, when the network delays are slightly higher. This shows that when the network conditions 
are relatively good, the PCAP system is close to the optimal envelope, bnt when situations worsen 
we move away. The gap between the system performance and the envelope indicates that the bound 
(Theorem]^ could be improved further. We leave this as an open question. 



Time(s) 


Figure 22: Consistency SLA with PCAP 
Cassandra under Lognormal delay: 

Timeline (tc = 0 ms,pic = 0.125, = 25 ms). 
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Figure 23: Consistency SLA with PCAP 
Cassandra under Lognormal delay: 

Scatter Plot (tc = 0 ms,pic = 0.125, fa = 25 ms). 


5.4.4 Effect of Read Repair Rate Knob 

All of our deployment experiments use read delay as the only control knob. Fig. |24| shows a portion 
of a run when only read repair rate was used by our PCAP Cassandra system. This was because 
read delay was already zero, and we needed to push pic up to pfj^. First we notice that pua does 
not change with read repair rate, as expected (Table. [^. Second, we notice that the convergence 
of Pic is very slow - it changes from 0.25 to 0.3 over a long period of 1000 s. 

Due to this slow convergence, we conclude that read repair rate is useful only when network 
delays remain relatively stable. Under continuously changing network conditions (e.g., a lognormal 
distribution) convergence may be slower and thus read delay should be used as the only control 
knob. 
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Figure 24: Effect of Read Repair Rate on PCAP Cassandra, pic = 0.31, tc = 0 ms, ta = 100 ms. 


5.4.5 Scalability 
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Figure 25: Scatter plot for same settings as Fig. 20, but with 32 servers and 16K ops/s. 


We measure scalability via an increased workload on PCAP Cassandra. Compared to Fig. [20l 
in this new run we increased the number of servers from 9 to 32, and throughput to 16000 ops/s, 
and ensured that each server stores at least some keys. All other settings are unchanged compared 
to Fig. The result is shown Fig. Compared with Fig. 20 we observe an improvement with 
scale - in particular, increasing the number of servers brings the system closer to optimal. 


5.4.6 Effect of Timeliness Requirement 


The timeliness requirements in an SLA directly affect how close the PCAP system is to the optimal- 
achievable envelope. Fig. [^shows the effect of varying the timeliness parameter ta in a consistency 
SLA {tc = 0 ms, Pic = 0.135) experiment for PCAP Cassandra with 10 ms node to LAN delays. 
For each ta, we consider the cluster of the {pua,Pic) points achieved by the PCAP system in its 
stable state, calculate its centroid, and measure (and plot on vertical axis) the distance d from this 
centroid to the optimal-achievable consistency-latency envelope. Note that the optimal envelope 


calculation also involves ta, since a depends on it (Section 5.4.1). 


Fig. 26 shows that when ta is too stringent (< 100 ms), the PCAP system may be far from 


the optimal envelope even when it satisfies the SLA. In the case of Fig. [26l this is because in our 
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network, the average time to cross four hops (client to coordinator to replica, and the reverse) is 
20 X 4 = 80 msj^ As ta starts to go beyond this (e.g., ta > 100 ms), the timeliness requirements 
are less stringent,and PCAP is essentially optimal (very close to the achievable envelope). 
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Figure 26: Effect of Timeliness Requirement (ta) on PCAP Cassandra. Consistency SLA with 
Pic = 0.135, tc = 0 ms. 


5.4.7 Passive Measurement Approach 

So far all our experiments have used the active measurement approach. In this section, we repeat a 
PCAP Cassandra consistency SLA experiment {pic = 0.2, tc = 0 ms) using a passive measurement 
approach. 

In Figure!^ instead of actively injecting operations, we sample ongoing client operations. We 
estimate pic and pua from the 100 latest operations from 5 servers selected randomly. 

At the beginning, the delay is lognormally distributed with p = 1 ms, a = 0.1 ms. The passive 
approach initially converges to the SLA. We change the delay {p = 2 ms, a = 0.2 ms) at t = 325 s. 
We observe that, compared to the active approach, 1) consistency (SLA metric) oscillates more, 
and 2) the availability (non-SLA metric) is farther from optimal and takes longer to converge. For 
the passive approach, SLA convergence and non-SLA optimization depends heavily on the sampling 
of operations used to estimate the metrics. Thus we conclude that it is harder to satisfy SLA and 
optimize the non-SLA metric with the passive approach. 

5.5 GeoPCAP Evaluation 

We evaluate GeoPCAP with a Monte-Carlo simulation. In our setup, we have four data-centers, 
among which three are remote data-centers holding replicas of a key, and the fourth one is the 
local data-center. At each iteration, we estimate t-freshness per data-center using a variation of the 
well-known WARS model [7]. The WARS model is based on Dynamo style quorum systems |21j . 
where data staleness is due to read and write message reordering. The model has four components. 
W represents the message delay from coordinator to replica. The acknowledgment from the replica 
back to the coordinator is modeled by a random variable A. The read message delay from coordi¬ 
nator to replica, and the acknowledgment back are represented by R, and S, respectively. A read 

^ Round-trip time for each hop is 2 x 10 = 20 ms. 
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Figure 27: Consistency SLA with PCAP Cassandra under Lognormal delay distribution: Timeline 
(Passive). 


will be stale if a read is acknowledged before a write reaches the replica, i. e. , R + S < W + A. 
In our simulation, we ignore the A component since we do not need to wait for a write to finish 
before a read starts. We use the Linkedin SSD disk latency distribution [7], Table 3 for read/write 
operation latency values. 

We model the WAN delay using a normal distribution A^(20 ms,y/2 ms) based on results 
from [9]. Each simulation runs for 300 iterations. At each iteration, we run the PID control 
loop (Figure to estimate a new value for geo-delay A, and sleep for 1 sec. All reads in the 
following iteration are delayed at the local data-center by A. At iteration 150, we inject a jump 
by increasing the mean and standard deviation of each WAN link delay normal distribution to 
22 ms and \/2.2 ms, respectively. We show only results for consistency and latency SLA for the 
ALL composition. The QUICKEST composition results are similar and are omitted. 
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Figure 28: CeoPCAP SLA Timeline 
for L SLA (all). 
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Figure 29: CeoPCAP SLA Timeline 
for C SLA (all). 


Figure 
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shows the timeline of SLA convergence for GeoPCAP Latency SLA = 0.27, = 


-,sla 


= 0.1 ms). We observe that using the PID controller {kp = 1, kd = 0.5, ki = 0.5), 
both the SLA and the other metric converge within 5 iterations initially and also after the jump. 
Figure 30 shows the corresponding evolution of the geo-delay control knob. Before the jump, the 
read delay converges to around 5 ms. After the jump, the WAN delay increase forces the geo-delay 
to converge to a lower value (around 3 ms) in order to meet the latency SLA. 
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Figure 30: Geo-delay Timeline 
for L SLA (ALL) (Figure\2^. 


Figure 31: Geo-delay Timeline 
for G SLA (ALL) (Figure^^. 


Figure 
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shows the consistency SLA = 0.38,tc^^“ = 1 ms,ta^^°‘ = 25 ms) time line. Here 
convergence takes 25 iterations, thanks to the PID controller {{kp = 1, kd = 0.8, ki = 0.5)). We 
needed a slightly higher valne for the differential gain kd to deal with increased oscillation for the 
consistency SLA experiment. Note that the value of 0.38 forces a smaller per data-center pic 
convergence. The corresponding geo-delay evolution (Figure [sT]) initially converges to around 3 ms 
before the jump, and converges to around 5 ms after the jump, to enforce the consistency SLA 
after the delay increase. 



Figure 32: Geo-delay Timeline for A SLA (ALL) with Multiplicative Approach. 


We also repeated the Latency SLA experiment with the ALL composition (Figures [^[30] ) using 
the multiplicative control approach (Section 3.3) instead of the PID control approach. Figure 32 
shows the corresponding geo-delay trend compared to Figure 


Comparing the two figures, we 
observe that although the multiplicative strategy converges as fast the PID approach both before 
and after the delay jump, the read delay value keeps oscillating around the optimal value. Such 
oscillations cannot be avoided in the multiplicative approach, since at steady state the control loop 
keeps changing direction with a unit step size. Compared to the multiplicative approach, the PID 
control approach is smoother and has less oscillations. 
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6 Related Work 


6.1 Consistency-Latency Tradeoffs 

There has been work on theoretically characterizing the tradeoff between latency and strong con¬ 
sistency models. Attiya and Welch [S] studied the tradeoff between latency and linearizability and 
sequential consistency. Subsequent work has explored linearizablity under different delay mod¬ 
els [231 132] • All these papers are concerned with strong consistency models whereas we consider 
t-freshness, which models data freshness in eventually consistent systems. Moreover, their delay 
models are different from our partition model. There has been theoretical work on probabilistic 
quorum systems [381 SHE]. Their consistency models are different from ours; moreover, they did 
not consider the tradeoff between consistency and availability. 

There are two classes of systems that are closest to our work. The first class of systems are 
concerned with metrics for measuring data freshness or staleness. We do not compare our work 
against this class of systems in this paper, as it is not our goal to propose yet another consistency 
model or metric. Bailis et al. [SIEI propose a probabilistic consistency model (PBS) for quorum- 
based stores, but did not consider latency, soft partitions or the CAP theorem. Golab et al. |3T| 
propose a time-based staleness metric called A-atomicity. A-atomicity is considered the gold stan¬ 
dard for measuring atomicity violations (staleness) across multiple read and write operations. The 
P metric [32] is inspired by the A metric and improves upon it on multiple fronts. For example, the 
P metric makes fewer technical assumptions than the A metric and produces less noisy results. It 
is also more robust against clock skew. All these related data freshness metrics cannot be directly 
compared to our t-freshness metric. The reason is that unlike our metric which considers write start 
times, these existing metrics consider end time of write operations when calculating data freshness. 

The second class of systems deal with adaptive mechanisms for meeting consistency-latency 
SLAs for key-value stores. The Pileus system |58| considers families of consistency/latency SLAs, 
and requires the application to specify a utility value with each SLA. In comparison, PCAP consid¬ 
ers probabilistic metrics of PicPua- Tuba [Hj extends the predefined and static Pileus mechanisms 
with dynamic replica reconfiguration mechanisms to maximize Pileus style utility functions without 
impacting client read and write operations. Golab and Wylie |33| propose consistency amplifica¬ 
tion, which is a framework for supporting consistency SLAs by injecting delays at servers or clients. 
In comparison, in our PGAP system, we only add delays at servers. McKenzie et al. |50| propose 
continuous partial quorums (GPQ), which is a technique to randomly choose between multiple 
discrete consistency levels for fine-grained consistency-latency tuning, and compare GPQ against 
consistency amplification. Gompared to all these systems where the goal is to meet SLAs, in our 
work, we also (1) quantitatively characterize the (un)achievable consistency-latency tradeoff enve¬ 
lope, and (2) show how to design systems that perform close to this envelope, in addition to (3) 
meeting SLAs. The PGAP system can be setup to work with any of these SLAs listed above; but 
we don’t do this in the paper since our main goal is to measure how close the PGAP system is to 
the optimal consistency-latency envelope. 

Recently, there has been work on declarative ways to specify application consistency and latency 
requirements - PGAP proposes mechanisms to satisfy such specifications m- 

6.2 Adaptive Systems 

There are a few existing systems that controls consistency in storage systems. FRAGS [63| controls 
consistency by allowing replicas to buffer updates up to a given staleness. AQuA |38| continuously 
moves replicas between “strong” and “weak” consistency groups to implement different consistency 
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levels. Fox and Brewer |25] show how to trade consistency (harvest) for availability (yield) in 
the context of the Inktomi search engine. While harvest and yield capture continuously chang¬ 
ing consistency and availability conditions, we characterize the consistency-availability (latency) 
tradeoff in a quantitative manner. TACT m controls staleness by limiting the number of out¬ 
standing writes at replicas (order error) and bounding write propagation delay (staleness). All 
the mentioned systems provide best-effort behavior for consistency, within the latency bounds. In 
comparison, the PCAP system explicitly allows applications to specify SLAs. Consistency levels 
have been adaptively changed to deal with node failures and network changes in m, however this 
may be intrusive for applications that explicitly set consistency levels for operations. Artificially 
delaying read operations at servers (similar to our read delay knob) has been used to eliminate 
staleness spikes (improve consistency) which are correlated with garbage collection in a specific 
key-value store (Apache Cassandra) [23]. Similar techniques have been used to guarantee causal 
consistency for client-side applications [62]. Simba |53| proposes new consistency abstractions for 
mobile application data synchronization services, and allows applications to choose among various 
consistency models. 

For stream processing, Gedik et al. |26] propose a control algorithm to compute the optimal 
resource requirements to meet throughput requirements. There has been work on adaptive elasticity 
control for storage [33], and adaptively tuning Hadoop clusters to meet SLAs |37|. Compared to 
the controllers present in these systems, our PCAP controller achieves control objectives [35] using 
a different set of techniques to meet SLAs for key-value stores. 

6.3 Composition 

Composing local policies to form global policies is well studied in other domains, for example QoS 
composition in multimedia networks |43j . software defined network (sdn) composition |5T], and 
web-service orchestration [22]. Our composition techniques are aimed at consistency and latency 
guarantees for geo-distributed systems. 

7 Summary 

In this paper, we have first formulated and proved a probabilistic variation of the CAP theorem 
which took into account probabilistic models for consistency, latency, and soft partitions within 
a data-center. Our theorems show the un-achievable envelope, i.e., which combinations of these 
three models make them impossible to achieve together. We then show how to design systems 
(called PCAP) that (1) perform close to this optimal envelope, and (2) can meet consistency 
and latency SLAs derived from the corresponding models. We then incorporated these SLAs 
into Apache Cassandra and Riak running in a single data-center. We also extended our PCAP 
system from a single data-center to multiple geo-distributed data-centers. Our experiments with 
YCSB workloads and realistic traffic demonstrated that our PCAP system meets the SLAs, that its 
performance is close to the optimal-achievable consistency-availability envelope, and that it scales 
well. Simulations of our GeoPCAP system also showed SLA satisfaction for applications spanning 
multiple data-centers. 
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